
SDForest
SDForest.RmdHere, we show the functionality of an SDForest and how you can use it to screen for causal parents of a continuous response in a large set of observed covariates, even in the presence of hidden confounding. We also provide methods to asses the functional partial dependence of the response on the causal parents.
We show the functionalities with simulated data from a confounded
model with two true causal parents of \(Y\), see
simulate_data_nonlinear().
vignette("realWorld") shows an example using real data.
library(SDForest)
set.seed(42)
# simulation of confounded data
q <- 2 # dimension of confounding
p <- 350 # number of overved covariates
n <- 300 # number of obervations
m <- 2 # number of causal parents in covariates
sim_data <- simulate_data_nonlinear(q = q, p = p, n = n, m = m)
X <- sim_data$X
Y <- sim_data$Y
train_data <- data.frame(X, Y)
# causal parents of y
f_1 <- apply(X, 1, function(x) f_four(x, sim_data$beta, sim_data$j[1]))
f_2 <- apply(X, 1, function(x) f_four(x, sim_data$beta, sim_data$j[2]))
# functional dependence on causal parents
par(mfrow = c(1, 2))
plot(x = X[, sim_data$j[1]], f_1, xlab = paste0('X_', sim_data$j[1]))
plot(x = X[, sim_data$j[2]], f_2, xlab = paste0('X_', sim_data$j[2]))
We first estimate the random forest using the spectral objective:
\[\hat{f} = \text{argmin}_{f' \in \mathcal{F}} \frac{||Q(\mathbf{Y} - f'(\mathbf{X}))||_2^2}{n}\] where \(Q\) is a spectral transformation. (Ćevid, Bühlmann, and Meinshausen 2020), (Ulmer?)
fit <- SDForest(Y ~ ., train_data)
fit
#> SDForest result
#>
#> Number of trees: 100
#> Number of covariates: 350
#> OOB loss: 3.9
#> OOB spectral loss: 0.08Causal parents
Given the estimated causal function, the first question we might want
to answer is which of the covariates are the causal parents of the
response. If we intervene on the causal parents, we expect the response
to change. For that, we can estimate the functional dependency of \(Y\) on \(X\), \(\widehat{f(X)}\) and examine which
covariates are important in this function. We can directly compare the
importance pattern of the deconfounded estimator to the classical random
forest estimated by ranger. This comparison to
the plain counterpart always gives a feeling of the strength of
confounding. If no confounding exists, SDForest() and
ranger::ranger() should result in similar models.
In the graph below, we see the variable importance
varImp() of the deconfounded random forest against the
plain random forest. The scale has no meaning, but we see how the two
true causal parents in red are getting a clear higher variable
importance for the SDForest. The plain random forest cannot distinguish
between spurious correlation and true causation.
# comparison to classical random forest
fit_ranger <- ranger::ranger(Y ~ ., train_data, importance = 'impurity')
# comparison of variable importance
imp_ranger <- fit_ranger$variable.importance
imp_sdf <- fit$var_importance
imp_col <- rep('black', length(imp_ranger))
imp_col[sim_data$j] <- 'red'
plot(imp_ranger, imp_sdf, col = imp_col, pch = 20,
xlab = 'ranger', ylab = 'SDForest',
main = 'Variable Importance')
Before, we looked at the variable importance of the non-regularized
SDForest. We have two more techniques to better understand which
variables might be causally important. The first is the regularization
path regPath(), where we plot the variable importance
against varying regularization. The option lets us visualize these paths
interactively to better understand which covariates seem to have robust
importance in the model.